🤔 How do we use distributions of parameters (prior or posterior) to make predictions?
Two sources of variability:
Distributional (Prior or Posterior) Variability
Sampling (Data) Variability
How certain are we about the value of the parameters?
How much will observed data typically vary from its expected value? a.k.a. inherent variability in the data around the true value.
\[ p(\tilde{x} \mid x) \]
Probability of new sample \(\tilde{x}\) given the current data \(x\).
draw \(\theta_i \sim \pi(\theta)\) (one sample of all the model parameters)
draw \(y_i \sim p(y_i \mid \theta_i)\) (sample of data from model implied by \(\theta_i\))
1 accounts for uncertainty about the parameters \(\theta\), 2 accounts for uncertainty in the sample
Two sources of variability:
Distributional (Prior or Posterior) Variability
Sampling (Data) Variability
Prior predictive checks allow us to simulate samples from our prior and make sure it lines up with our expectations about reality.
Let’s adjust our priors to be more realistic.
Posterior predictive checks do something similar, but simulates new data based on the posterior. This allows us to “look for systematic discrepancies between real and simulated data” (Gelman et al. 2004) which tells us a little about model fit.
Remember, p-values tell you:
given a distribution \(\pi(\theta)\), what is the probability of observing a \(\theta\) as or more extreme as the one we did? In other words, how consistent is \(\theta\) with \(\pi(\theta)\)?
# generate samples from posterior
yrep <- posterior_predict(m1, draws = 500)
# overall mean function
overall_mu <- function(x) mean(x)
# calc for actual data
overall_mu(df$y) [1] 0.236782
mtcars DataRows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
Family: gaussian
Links: mu = identity; sigma = identity
Formula: mpg ~ hp + gear
Data: mtcars (Number of observations: 32)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 18.08 3.32 11.34 24.61 1.00 4653 2268
hp -0.06 0.01 -0.08 -0.05 1.00 4534 2744
gear 3.09 0.78 1.63 4.66 1.00 4503 2274
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 3.21 0.43 2.50 4.19 1.00 3367 2587
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Family: gaussian
Links: mu = identity; sigma = identity
Formula: mpg ~ hp + gear
Data: mtcars (Number of observations: 32)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 18.08 3.32 11.34 24.61 1.00 4653 2268
hp -0.06 0.01 -0.08 -0.05 1.00 4534 2744
gear 3.09 0.78 1.63 4.66 1.00 4503 2274
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 3.21 0.43 2.50 4.19 1.00 3367 2587
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
(more on ROPE with a brms model here)
# s/o to Chat GPT for helping simulat the data
# libs
library(brms)
library(bayesplot)
library(tidybayes)
# sim data
set.seed(540)
n <- 200
parental_income <- rnorm(n, mean = 50, sd = 10) # income_in_k
z <- 1 + 0.05 * parental_income + rnorm(n, 0, 1)
p <- 1 / (1 + exp(-z))
passed_exam <- rbinom(n, 1, p)
df <- data.frame(parental_income, passed_exam)
# priors
priors <- c(
prior(normal(0, 5), class = "b"),
prior(normal(0, 5), class = "Intercept")
)
# fit
fit <- brm(passed_exam ~ parental_income, data = df,
family = bernoulli(), prior = priors,
seed = 123, chains = 4, cores = 4, iter = 2000)Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
Family: bernoulli
Links: mu = logit
Formula: passed_exam ~ parental_income
Data: df (Number of observations: 200)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.30 1.69 -2.08 4.61 1.00 2293 2106
parental_income 0.04 0.03 -0.03 0.11 1.00 2132 1965
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# s/o to Chat GPT for helping simulate the data
# Load required libraries
library(brms)
library(bayesplot)
library(tidybayes)
# sim
set.seed(123)
n_schools <- 10
n_students <- 20
total_n <- n_schools * n_students
school <- factor(rep(1:n_schools, each = n_students))
parental_income <- rnorm(total_n, mean = 50, sd = 10) # Parental income in thousands of dollars
school_effect <- rnorm(n_schools, 0, 5)[school]
individual_error <- rnorm(total_n, 0, 5)
exam_score <- 50 + 0.5 * parental_income + school_effect + individual_error
df <- data.frame(school, parental_income, exam_score)
# priors
priors <- c(
prior(normal(0, 10), class = "b"),
prior(normal(50, 10), class = "Intercept"),
prior(exponential(1), class = "sd")
)
# fit
fit <- brm(exam_score ~ parental_income + (1 + parental_income | school),
data = df, family = gaussian(), prior = priors,
seed = 123, chains = 4, cores = 4, iter = 2000)Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
Family: gaussian
Links: mu = identity; sigma = identity
Formula: exam_score ~ parental_income + (1 + parental_income | school)
Data: df (Number of observations: 200)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~school (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept) 1.03 0.97 0.03 3.66 1.00
sd(parental_income) 0.11 0.03 0.06 0.19 1.00
cor(Intercept,parental_income) -0.02 0.55 -0.94 0.93 1.01
Bulk_ESS Tail_ESS
sd(Intercept) 2631 1886
sd(parental_income) 1048 1628
cor(Intercept,parental_income) 348 1082
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 51.74 1.99 47.84 55.71 1.00 5261 2955
parental_income 0.48 0.05 0.37 0.58 1.00 2121 1788
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 4.94 0.26 4.45 5.47 1.00 4369 2284
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).